unify moe implementation for llama4 and deepseek_v3 #1534

tianyu-l · 2025-08-06T01:02:09Z

Given the complexity of MoE and EP modules
This PR

creates torchtitan/models/moe.py as the central moe implementation (this is similar to why we have torchtitan/models/attention.py)
creates torchtitan/distributed/expert_parallel.py as the central EP implementation
rename torchtitan/distributed/pipeline.py -> torchtitan/distributed/pipeline_parallel.py to be consistent with EP
apply temporary fix by @rakkit possible memory leaking of DP2EP with recompute #1467 before the memory leak issue with AC + PT-D all_to_all_single_autograd is fixed (cc @soulitzer)

danielvegamyhre

lgtm, left a couple of minor comments/questions.

danielvegamyhre · 2025-08-06T01:22:08Z

torchtitan/distributed/expert_parallel.py

 import torch.nn as nn
-from torch.distributed._functional_collectives import all_to_all_single_autograd
+
+# from torch.distributed._functional_collectives import all_to_all_single_autograd


nit: remove commented code

this is intentional -- we should restore this implementation after bug is fixed. I reorganized the code a bit to make it clearer.

danielvegamyhre · 2025-08-06T01:22:59Z

torchtitan/distributed/expert_parallel.py

+    @staticmethod
+    def forward(ctx, x, out_splits, in_splits, group):
+        if isinstance(out_splits, torch.Tensor):
+            out_splits = out_splits.tolist()


won't tolist() cause d2h sync? is this okay / intentional in this case?

It will. This is a temporary fix, but currently in EP there are multiple places with d2h sync. I'm working on another implementation to kill them.

ruisizhang123

LGTM, thank you for the refactor.

wwwjn · 2025-08-06T04:32:23Z

torchtitan/models/moe.py

+
+@dataclass
+class MoEArgs:
+    moe_enabled: bool = True


Why we need to have moe_enabled in MoEArgs?

I didn't see anywhere this parameter is false

makes sense, removed

wwwjn

LGTM! Nice refactor

rakkit · 2025-08-06T05:14:28Z

(sry, I am leaving for vacation and lazy to open PR).
Since you are refactoring MOE, here is another trick or bug we found. When we use DP2EP, the gradient reduction denominator (of EP) is always smaller than other modules(e.g. attention or embedding) so the actual gradient of EP is always higher. See logs.

we fixed this by adding a loss_average_denominator in parallel_dims.py and forcing EP's reduce denominator to

            transformer_block.feed_forward.experts.set_reduce_scatter_divide_factor(
                loss_average_denominator,
            )

in apply_fsdp, you can also check full code here

where loss_average_denominator = dp_replicate * dp_shard *cp see here

We also have a better version of the bias update that only needs one reduce, you can check the code here. A tricks here does not affect to Bias update, but we need to know, once we have activation checkpoint.

        if self.load_balance_coeff is not None:
           with torch.no_grad():
               self.tokens_per_expert.add_(num_tokens_per_expert)

It will be called more than once, so the actual stats value of num_tokens_per_expert will be 2x.

@rakkit

Given the complexity of MoE and EP modules This PR 1. creates `torchtitan/models/moe.py` as the central moe implementation (this is similar to why we have `torchtitan/models/attention.py`) 2. creates `torchtitan/distributed/expert_parallel.py` as the central EP implementation 3. rename `torchtitan/distributed/pipeline.py` -> `torchtitan/distributed/pipeline_parallel.py` to be consistent with EP 4. apply temporary fix by @rakkit pytorch#1467 before the memory leak issue with AC + PT-D all_to_all_single_autograd is fixed (cc @soulitzer)

@rakkit

Given the complexity of MoE and EP modules This PR 1. creates `torchtitan/models/moe.py` as the central moe implementation (this is similar to why we have `torchtitan/models/attention.py`) 2. creates `torchtitan/distributed/expert_parallel.py` as the central EP implementation 3. rename `torchtitan/distributed/pipeline.py` -> `torchtitan/distributed/pipeline_parallel.py` to be consistent with EP 4. apply temporary fix by @rakkit pytorch#1467 before the memory leak issue with AC + PT-D all_to_all_single_autograd is fixed (cc @soulitzer)

@rakkit

issue pointed out in #1534 (comment) pytorch/pytorch#160285 solution given by @rakkit in #1534 (comment)

garrett361 · 2025-08-13T21:14:11Z

Nice @rakkit, we found the same issue with the ep grads being off by a factor. I was finding that set_reduce_scatter_divide_factor errored when using an mp policy, though.

Surprised you didn't hit that? Think I saw you're on torch==2.6 in another comment elsewhere

rakkit · 2025-08-13T21:29:58Z

lol @garrett361 thanks for the info. I did not see the issue on both Torch 2.6 and 2.7.1.

To clarify I only test the default mp set(mixed_precision_param=bf16 and mixed_precision_reduce=fp32)

tianyu-l requested review from H-Huang, danielvegamyhre, ruisizhang123 and soulitzer August 6, 2025 01:02

tianyu-l requested review from fegin, wconstab and wwwjn as code owners August 6, 2025 01:02

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 6, 2025

tianyu-l mentioned this pull request Aug 6, 2025

[dsv3] store expert weights such that we can transpose before grouped gemm to have column-major memory layout #1535

Closed

tianyu-l linked an issue Aug 6, 2025 that may be closed by this pull request

Circular imports #1383

Closed

danielvegamyhre approved these changes Aug 6, 2025

View reviewed changes

tianyu-l force-pushed the cleanup branch from dc3946e to eec8bc5 Compare August 6, 2025 01:41

tianyu-l mentioned this pull request Aug 6, 2025

Circular imports #1383

Closed

ruisizhang123 approved these changes Aug 6, 2025

View reviewed changes

tianyu-l force-pushed the cleanup branch 2 times, most recently from 85dc2ad to 16ad9f5 Compare August 6, 2025 02:44

wwwjn reviewed Aug 6, 2025

View reviewed changes

wwwjn approved these changes Aug 6, 2025

View reviewed changes

unify moe implementation for llama4 and deepseek_v3

7f6b148

tianyu-l force-pushed the cleanup branch from 16ad9f5 to 7f6b148 Compare August 6, 2025 05:39

tianyu-l merged commit a9aa506 into main Aug 6, 2025
7 checks passed

tianyu-l deleted the cleanup branch August 6, 2025 05:47

tianyu-l mentioned this pull request Aug 6, 2025

quick fix import error #1536

Closed

danielvegamyhre mentioned this pull request Aug 8, 2025

[moe training] update tests and benchmarks for torchtitan moe refactor pytorch/ao#2718

Closed

This was referenced Aug 11, 2025

Wrong-size gradients in Expert Parallel MoE pytorch/pytorch#160285

Closed

fix EP fsdp gradient divide factor #1551

Merged

tianyu-l added a commit that referenced this pull request Aug 11, 2025

fix EP fsdp gradient divide factor (#1551)

59e57a4

issue pointed out in #1534 (comment) pytorch/pytorch#160285 solution given by @rakkit in #1534 (comment)

This was referenced Aug 12, 2025

[moe training] update tests for torchtitan moe refactor pytorch/ao#2733

Merged

[moe training] fix scaling type bug; refactor distributed tests pytorch/ao#2749

Merged

rakkit mentioned this pull request Aug 13, 2025

possible memory leaking of DP2EP with recompute #1467

Closed

unify moe implementation for llama4 and deepseek_v3 #1534

unify moe implementation for llama4 and deepseek_v3 #1534

Uh oh!

Conversation

tianyu-l commented Aug 6, 2025

Uh oh!

danielvegamyhre left a comment

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

tianyu-l Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

tianyu-l Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ruisizhang123 left a comment

Choose a reason for hiding this comment

Uh oh!

wwwjn Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

wwwjn Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

tianyu-l Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

wwwjn left a comment

Choose a reason for hiding this comment

Uh oh!

rakkit commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

garrett361 commented Aug 13, 2025

Uh oh!

rakkit commented Aug 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

tianyu-l Aug 6, 2025 •

edited

Loading

tianyu-l Aug 6, 2025 •

edited

Loading

rakkit commented Aug 6, 2025 •

edited

Loading